Week 7: Spatial Linear Models

Helene Wagner and Yessica Rico

2018-03-12

1. Overview of Worked Example

a) Goals

This worked example shows:

b) Data set

Here we analyze population-level data of the wildflower Dianthus carthusianorum (common name: Carthusian pink) in 65 calcareous grassland patches in the Franconian Jura, Germany (Rico et al. 2013):

c) Required R libraries

All required packages should have been installed already when you installed ‘LandGenCourse’.

Note: the function ‘library’ will always load the package, even if it is already loaded, whereas ‘require’ will only load it if it is not yet loaded. Either will work.

require(LandGenCourse)
#require(here)
#require(spdep)
#require(nlme)
#require(lattice)
#require(MuMIn)
require(ggplot2)
source(system.file("extdata", "panel.cor.r", 
                            package = "LandGenCourse"))

Package ‘spmoran’ not automatically installed with ‘LandGenCourse’:

if(!require(spmoran)) install.packages("spmoran", repos='http://cran.us.r-project.org')
#require(spmoran)

2. Explore data set

We will model allelic richness ‘A’ as a function of the following predictors:

The connectivity indices Si were calculated for each focal patch i, integrating over all other patches j where the species was present (potential source patches) using Hanski’s incidence function.

a) Import data

data(Dianthus)

Allelic richness ‘A’ was not calculate for populations with < 5 individuals. Here we extract only the patches with ‘A’ values, and the variables needed, and store them in a data frame ‘Dianthus.df’.

b) Create a map

Note: the coordinates ‘x’ and ‘y’ in Dianthus@coords are Gauss-Krueger projection GK4 as defined in Dianthus@proj4string. In addition, ‘Longitude’ and ‘Latitude’ are available as variables in Dianthus@data. This is useful for functions like ‘qmplot’ that expect latlon coordinates.

ggmap::qmplot(x =  Longitude, y = Latitude, data = Dianthus@data,
              source = "google", maptype = "terrain", zoom = 12)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=48.986778,11.000013&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

As you can see from the map, most sites lie on the steep slopes between an upper and a lower Jurassic plateau. A few sites, south of road “S12216”, lie at the forest edge on the upper plateau, typically in areas where the soil is too shallow to allow crop farming. With in the study area, all known sites were sampled. Additional sites are expected to be found mainly in the valley system in the Southwest.

c) Explore correlations

When fitting linear models, it is always a good idea to look at the correlations first.

Dianthus.df <- data.frame(A=Dianthus@data$A, IBD=Dianthus@data$Eu_pj, 
                          IBR=Dianthus@data$Sheint_pj,
                          PatchSize=log(Dianthus@data$Ha),
                          Longitude=Dianthus@data$Longitude,
                          Latitude=Dianthus@data$Latitude,
                          x=Dianthus@coords[,1], y=Dianthus@coords[,2])

Dianthus.df <- Dianthus.df[!is.na(Dianthus.df$A),]
dim(Dianthus.df)
## [1] 59  8
pairs(Dianthus.df, lower.panel=panel.smooth, upper.panel=panel.cor,
      diag.panel=panel.hist)

Questions:

Let’s also check the association between patch size and population size:

boxplot(log(Dianthus$Ha) ~ Dianthus$pop09, ylab="PatchSize (log(Ha))",
        xlab="Population size category")

Even though the population size categories were very broad, there appears to be a strong relationship between populations size (category) and (the logarithm of) patch size.

Despite this relationship, connectivity models Si that only considered Dianthus carthusianorum presence/absence (‘pj’) in source patches ‘j’ were better supported than those Si models that took into account source patch area (‘Aj’) or population size (‘Nj’).

We can check this by calculating the correlation of allelelic richness ‘A’ with each of the 15 connectivity models ‘Si’ in the data set.

round(matrix(cor(Dianthus@data$A, Dianthus@data[,15:29], 
                 use="pairwise.complete.obs"), 5, 3, byrow=TRUE, 
           dimnames=list(c("Eu", "Shecte", "Sheint", "Shenu", "Forest"), 
                         c("pj", "Aj", "Nj"))),3)
##           pj     Aj    Nj
## Eu     0.024  0.207 0.098
## Shecte 0.375  0.273 0.369
## Sheint 0.403  0.196 0.369
## Shenu  0.135 -0.045 0.105
## Forest 0.137  0.168 0.138

3. Test regression residuals for spatial autocorrelation

a) Fit regression models

Here we fit three multiple regression models to explain variation in allelic richness:

mod.lm.IBD <- lm(A ~ IBD, data = Dianthus.df)
summary(mod.lm.IBD)
## 
## Call:
## lm(formula = A ~ IBD, data = Dianthus.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68545 -0.10220  0.03883  0.16178  0.36100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.070960   0.064061  63.549   <2e-16 ***
## IBD         0.008454   0.047460   0.178    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2324 on 57 degrees of freedom
## Multiple R-squared:  0.0005563,  Adjusted R-squared:  -0.01698 
## F-statistic: 0.03173 on 1 and 57 DF,  p-value: 0.8593

This model does not fit the data at all!

mod.lm.IBR <- lm(A ~ IBR, data = Dianthus.df)
summary(mod.lm.IBR)
## 
## Call:
## lm(formula = A ~ IBR, data = Dianthus.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66844 -0.11251  0.03418  0.12219  0.41760 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.92306    0.05499  71.348  < 2e-16 ***
## IBR          0.25515    0.07672   3.326  0.00155 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2128 on 57 degrees of freedom
## Multiple R-squared:  0.1625, Adjusted R-squared:  0.1478 
## F-statistic: 11.06 on 1 and 57 DF,  p-value: 0.001547

This model fits much better. Let’s check the residuals plots.

par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(mod.lm.IBR)

par(mfrow=c(1,1))

The residuals show some deviation from a normal distribution. Specifically, the lowest values are lower than expected.

mod.lm.PatchSize <- lm(A ~ PatchSize + IBR, data = Dianthus.df)
summary(mod.lm.PatchSize)
## 
## Call:
## lm(formula = A ~ PatchSize + IBR, data = Dianthus.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68359 -0.08844  0.02538  0.11705  0.41705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.05158    0.07781  52.069   <2e-16 ***
## PatchSize    0.04266    0.01888   2.260   0.0277 *  
## IBR          0.11338    0.09709   1.168   0.2479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2055 on 56 degrees of freedom
## Multiple R-squared:  0.2325, Adjusted R-squared:  0.2051 
## F-statistic: 8.482 on 2 and 56 DF,  p-value: 0.0006058

This combinde model explains more variation in allelic richness than the IBR model alone. Moreover, after adding PatchSizes, the IBR term is no longer statistically significant!

Has the distribution of residuals improved as well?

par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(mod.lm.PatchSize)

par(mfrow=c(1,1))

Not really!

b) Test for spatial autocorrelation (Moran’s I):

Before we interpret the models, let’s check whether the assumption of independent residuals is violated by spatial autocorrelation in the residuals.

To calculate and test Moran’s I, we first need to define neighbours and spatial weights. Here we use a Gabriel graph to define neighbours.

We define weights in three ways (see Week 5 video and tutorial for code explanation):

In each case, we row-standardize the weights with the option ‘style = “W”’.

Note: when using ‘graph2nb’, make sure to use the argument ‘sym=TRUE’. This means that if A is a neighbour of B, B is also a neighbour of A. The default is ‘sym=FALSE’, which may result in some sites not having any neighbours assigned (though this would not be evident from the figure!).

xy <- data.matrix(Dianthus.df[,c("x", "y")])
nb.gab <- spdep::graph2nb(spdep::gabrielneigh(xy), sym=TRUE)
par(mar=c(0,0,0,0))
plot(nb.gab, xy)

listw.gab <- spdep::nb2listw(nb.gab)

dlist <- spdep::nbdists(nb.gab, xy)
dlist <- lapply(dlist, function(x) 1/x)
listw.d1 <- spdep::nb2listw(nb.gab, style = "W", glist=dlist)
dlist <- lapply(dlist, function(x) 1/x^2)
listw.d2 <- spdep::nb2listw(nb.gab, style = "W", glist=dlist)

Now we can quantify and test Moran’s I for each variable to test for spatial autocorrelation in response and predictor variables. For now, we’ll take the simple weights ‘listw.gab’.

spdep::moran.test(Dianthus.df$A, listw.gab)             
## 
##  Moran I test under randomisation
## 
## data:  Dianthus.df$A  
## weights: listw.gab  
## 
## Moran I statistic standard deviate = 2.881, p-value = 0.001982
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.29180913       -0.01724138        0.01150691
spdep::moran.test(Dianthus.df$IBD, listw.gab)
## 
##  Moran I test under randomisation
## 
## data:  Dianthus.df$IBD  
## weights: listw.gab  
## 
## Moran I statistic standard deviate = 5.9689, p-value = 1.194e-09
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.62709841       -0.01724138        0.01165320
spdep::moran.test(Dianthus.df$IBR, listw.gab) 
## 
##  Moran I test under randomisation
## 
## data:  Dianthus.df$IBR  
## weights: listw.gab  
## 
## Moran I statistic standard deviate = 3.008, p-value = 0.001315
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.30918167       -0.01724138        0.01177588
spdep::moran.test(Dianthus.df$PatchSize, listw.gab) 
## 
##  Moran I test under randomisation
## 
## data:  Dianthus.df$PatchSize  
## weights: listw.gab  
## 
## Moran I statistic standard deviate = 3.4257, p-value = 0.0003066
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.35490160       -0.01724138        0.01180103

Questions:

Next, let’s test each model for autocorrelation in the residuals:

spdep::lm.morantest(mod.lm.IBD, listw.gab) 
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = A ~ IBD, data = Dianthus.df)
## weights: listw.gab
## 
## Moran I statistic standard deviate = 2.9439, p-value = 0.00162
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.28889983      -0.02854559       0.01162751
spdep::lm.morantest(mod.lm.IBR, listw.gab)          
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = A ~ IBR, data = Dianthus.df)
## weights: listw.gab
## 
## Moran I statistic standard deviate = 1.883, p-value = 0.02985
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.18032443      -0.02296810       0.01165614
spdep::lm.morantest(mod.lm.PatchSize, listw.gab)       
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = A ~ PatchSize + IBR, data = Dianthus.df)
## weights: listw.gab
## 
## Moran I statistic standard deviate = 1.7484, p-value = 0.04019
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##       0.16223745      -0.02708922       0.01172530

Quite a bit of the spatial autocorrelation in allelic richness can be explained by the spatial structure in the predictors IBR and PatchSize. There is still statistically significant spatial autocorrelation in the residuals, though it is not strong any more.

4. Fit models with spatially correlated error (GLS) with package ‘nlme’

One way to account for spatial autocorrelation in the residuals is to fit a Generalized Least Squares model (GLS) with a spatially autocorrelated error structure. See also: http://rfunctions.blogspot.ca/2017/06/how-to-identify-and-remove-spatial.html

a) Plot empirical variogram

The error structure in a GLS is defined in a geostatistical framework, based on a variogram and as a function of distance between observations. Hence we start with plotting an empirical variogram of the residuals, with a smooth line. Here we specify ’resType = “normalized”, which means that the variogram will be fitted to the normalized residuals of the model.

The expected value of the semivariance will be 1. Hence it would make sense to add a horizontal line at 1. However, this is cumbersome with the trellis graphics (using package ‘lattice’) used by ‘nlme’.

model.lm <- nlme::gls(A ~ IBR + PatchSize, data = Dianthus.df)
summary(model.lm)
## Generalized least squares fit by REML
##   Model: A ~ IBR + PatchSize 
##   Data: Dianthus.df 
##         AIC      BIC   logLik
##   0.5962264 8.697633 3.701887
## 
## Coefficients:
##                Value  Std.Error  t-value p-value
## (Intercept) 4.051575 0.07781098 52.06945  0.0000
## IBR         0.113378 0.09709132  1.16775  0.2479
## PatchSize   0.042664 0.01887968  2.25979  0.0277
## 
##  Correlation: 
##           (Intr) IBR   
## IBR       -0.922       
## PatchSize  0.731 -0.646
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -3.3264268 -0.4303604  0.1235065  0.5695885  2.0294197 
## 
## Residual standard error: 0.2055025 
## Degrees of freedom: 59 total; 56 residual
semivario <- nlme::Variogram(model.lm, form = ~x  + y, resType = "normalized")

If you want to create your own figure, e.g. with ‘ggplot2’, you can access the values stores in the data frame ‘semivario’ to plot the points, and add a smooth line yourself. Then we can add a horizontal line with ‘geom_hline’.

ggplot(data=semivario, aes(x=dist, y=variog)) + geom_point() + geom_smooth(se=FALSE) +
  geom_hline(yintercept=1) + ylim(c(0,1.3)) + xlab("Distance") + ylab("Semivariance")
## `geom_smooth()` using method = 'loess'

Question:

b) Fit variogram models

We can ask R to fit different types of variogram models to this empirical variogram. The model family (e.g., exponential, gaussian, spherical) determines the general shape of the curve that will be fitted. With ‘nugget=T’, we indicate that a nugget effect should be fitted.

mod.corExp <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, 
                            correlation = nlme::corExp(form = ~ x + y, nugget=T))

mod.corGaus <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, 
                            correlation = nlme::corGaus(form = ~ x + y, nugget=T))

mod.corSpher <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, 
                            correlation = nlme::corSpher(form = ~ x + y, nugget=T))

#mod.corLin <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, 
#                            correlation = nlme::corLin(form = ~ x + y, nugget=T))

mod.corRatio <- nlme::gls( A ~ PatchSize + IBR, data = Dianthus.df, 
                            correlation = nlme::corRatio(form = ~ x + y, nugget=T))

c) Select best-fitting model

Now we compare all models for which we did not get an error message:

MuMIn::model.sel(model.lm, mod.corExp, mod.corGaus, 
                 mod.corSpher, mod.corRatio)     
## Model selection table 
##              (Intrc)     IBR   PtchS  correlation df logLik AICc delta
## mod.corExp     4.044 0.08468 0.04269 n::cE(x+y,T)  6  6.283  1.1  0.00
## mod.corRatio   4.032 0.09704 0.04129 n::cR(x+y,T)  6  6.229  1.2  0.11
## model.lm       4.052 0.11340 0.04266               4  3.702  1.3  0.29
## mod.corGaus    4.047 0.11840 0.04247 n::cG(x+y,T)  6  3.763  6.1  5.04
## mod.corSpher   4.048 0.11710 0.04242 n::cS(x+y,T)  6  3.760  6.1  5.04
##              weight
## mod.corExp    0.336
## mod.corRatio  0.319
## model.lm      0.291
## mod.corGaus   0.027
## mod.corSpher  0.027
## Abbreviations:
## correlation: n::cE(x+y,T) = 'nlme::corExp(~x+y,T)', 
##              n::cG(x+y,T) = 'nlme::corGaus(~x+y,T)', 
##              n::cR(x+y,T) = 'nlme::corRatio(~x+y,T)', 
##              n::cS(x+y,T) = 'nlme::corSpher(~x+y,T)'
## Models ranked by AICc(x)

The list sorts the models, with the best model on top. The last column ‘weight’ contains the model weight, which indicate how much support there is for each model, given all other models in the set (see Week 12). Here, the exponential model fitted best, though the ratio model and the model without a spatially correlated error structure fitted the data almost equally well.

summary(mod.corExp)  
## Generalized least squares fit by REML
##   Model: A ~ PatchSize + IBR 
##   Data: Dianthus.df 
##         AIC     BIC   logLik
##   -0.565214 11.5869 6.282607
## 
## Correlation Structure: Exponential spatial correlation
##  Formula: ~x + y 
##  Parameter estimate(s):
##        range       nugget 
## 2706.6744822    0.5817938 
## 
## Coefficients:
##                Value  Std.Error  t-value p-value
## (Intercept) 4.044148 0.09311469 43.43191  0.0000
## PatchSize   0.042689 0.01809960  2.35855  0.0219
## IBR         0.084681 0.09127713  0.92774  0.3575
## 
##  Correlation: 
##           (Intr) PtchSz
## PatchSize  0.517       
## IBR       -0.654 -0.603
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -3.0122222 -0.2556758  0.2457265  0.5944553  1.9942337 
## 
## Residual standard error: 0.2190384 
## Degrees of freedom: 59 total; 56 residual

The fitted model with the exponential error structure shows a significant effect for PatchSize but not for the IBR term.

We don’t get an R-squared value directly, but we can calculate a pseudo R-squared from a regression of the response ‘A’ on the fitted values.

summary(lm(A ~ fitted(mod.corExp), data = Dianthus.df))$r.squared
## [1] 0.2317629
summary(mod.lm.PatchSize)$r.squared
## [1] 0.2324922

The pseudo R-squared is almost identical to the R-squared of the non-spatial ‘lm’ model.

Let’s check some residual plots - in this case, we have to construct them ourselves.

par(mfrow=c(1,2), mar=c(4,4,1,1))
plot(fitted(mod.corExp), residuals(mod.corExp))
abline(h=0,lty=3)

qqnorm(residuals(mod.corExp), main="")
qqline(residuals(mod.corExp))

The normal probability plot still looks about the same.

semivario <- nlme::Variogram(mod.corExp, form = ~ x + y, 
                             resType = "normalized")
plot(semivario, smooth = TRUE)

The variogam does look better!

d) Plot fitted variogram model

How can we plot the fitted variogram? Let’s first store it in an object ‘Fitted.variog’, then plot it.

Fitted.variog <- nlme::Variogram(mod.corExp)
class(Fitted.variog)
## [1] "Variogram"  "data.frame"
plot(Fitted.variog)

That was easy. If we want to access the fitted values (i.e. the exponential model curve values), this is a bit more involved.

The object ‘Fitted.variog’ has two classes: “Variogram” and “data.frame”. Technically, it is a data frame (S3) with additional attributes. This raises a challenge, because we access attributes of S3 objects with ‘\(', but we also use '\)’ to access columns in a data frame.

If we just print ‘Fitted.variog’, we only see the data frame.

head(Fitted.variog)
##      variog      dist n.pairs
## 1 0.6782704  620.5202      85
## 2 0.6112605 1236.8297      86
## 3 0.7415494 1827.3367      85
## 4 0.5751311 2269.6729      86
## 5 1.0503017 2691.2494      85
## 6 0.7493939 3074.7370      86

We can see the attributes listed by using ‘str’:

str(Fitted.variog)
## Classes 'Variogram' and 'data.frame':    20 obs. of  3 variables:
##  $ variog : num  0.678 0.611 0.742 0.575 1.05 ...
##  $ dist   : num  621 1237 1827 2270 2691 ...
##  $ n.pairs: int [1:20(1d)] 85 86 85 86 85 86 85 86 85 86 ...
##   ..- attr(*, "dimnames")=List of 1
##   .. ..$ : chr  "(106,971]" "(971,1.56e+03]" "(1.56e+03,2.02e+03]" "(2.02e+03,2.5e+03]" ...
##  - attr(*, "modelVariog")=Classes 'Variogram' and 'data.frame':  50 obs. of  2 variables:
##   ..$ variog: num  0.598 0.637 0.673 0.705 0.734 ...
##   ..$ dist  : num  106 385 665 945 1224 ...
##  - attr(*, "collapse")= logi TRUE

The line we are looking for is: ’- attr(*, “modelVariog”)‘. The attributes ’modelVariog’ has 50 rows (obs.) and 2 variables: ‘\(variog' and '\)dist’. These are the fitted values (i.e., values of the exponential variogram model for 50 distance values).

The notation attr(*, “modelVariog”) is a cryptic way of telling us how to access the attribute: use the function ‘attr’ and provide two arguments: the object names ‘Fitted.variog’ (represented by the asterisk), and the name of the attribute, in quotes: ‘attr(Fitted.variog, “modelVariog”)’.

tibble::as.tibble(attr(Fitted.variog, "modelVariog"))
## # A tibble: 50 x 2
##    variog  dist
##     <dbl> <dbl>
##  1  0.598   106
##  2  0.637   385
##  3  0.673   665
##  4  0.705   945
##  5  0.734  1224
##  6  0.760  1504
##  7  0.784  1783
##  8  0.805  2063
##  9  0.824  2342
## 10  0.841  2622
## # ... with 40 more rows

This is useful to know if you want to create your own figures, e.g. with ‘ggplot2’.

ggplot(data=Fitted.variog, aes(x=dist, y=variog)) + geom_point() + 
  ylim(c(0,1.3)) + xlab("Distance") + ylab("Semivariance") + 
  geom_line(data=attr(Fitted.variog, "modelVariog"), aes(x=dist, y=variog), color="blue")

5. Fit spatial simultaneous autoregressive error models (SAR)

An alternative way to account for spatial autocorrelation in the residuals is spatial regression with a simultaneous autoregressive error model (SAR).

a) Fit and compare alternative SAR models

The method ‘errorsarlm’ fits a simultaneous autoregressive model (‘sar’) to the error (‘error’) term of a ‘lm’ model.

This approach is based on spatial neighbours and weights. We have already defined them in three versions of a ‘listw’ object. Let’s see which one fits the data best.

mod.sar.IBR.gab <- spdep::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.gab)
mod.sar.IBR.d1 <- spdep::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.d1)
mod.sar.IBR.d2 <- spdep::errorsarlm(A ~ PatchSize + IBR, data = Dianthus.df, 
                                 listw = listw.d2)

MuMIn::model.sel(mod.lm.IBR, mod.sar.IBR.gab, mod.sar.IBR.d1, mod.sar.IBR.d2) 
## Model selection table 
##                 (Intrc)     IBR   PtchS family class listw df logLik  AICc
## mod.sar.IBR.d1    4.067 0.09058 0.04142     () sarlm ls.d1  5 12.500 -13.9
## mod.sar.IBR.gab   4.063 0.09458 0.04081     () sarlm ls.gb  5 12.215 -13.3
## mod.sar.IBR.d2    4.055 0.10650 0.04133     () sarlm ls.d2  5 11.549 -12.0
## mod.lm.IBR        3.923 0.25520          gassn    lm        3  8.603 -10.8
##                 delta weight
## mod.sar.IBR.d1   0.00  0.425
## mod.sar.IBR.gab  0.57  0.320
## mod.sar.IBR.d2   1.90  0.164
## mod.lm.IBR       3.10  0.090
## Abbreviations:
## family: gassn = 'gaussian'
## listw: ls.d1 = 'listw.d1', ls.d2 = 'listw.d2', ls.gb = 'listw.gab'
## Models ranked by AICc(x)

The best model (‘mod.sar.IBR.d1’) is the one with (row-standardized) inverse-distance weights (‘listw.d1’). It is only slightly better than the model with the (row-standardized) binary weights (‘listw.gab’), whereas the nonspatial model and the one with (row-standardized) inverse squared distance weights have much less support.

b) Interpret best-fitting SAR model

Let’s have a look at the best model. With the argument ‘Nagelkerke = TRUE’, we request a pseudo R-squared.

summary(mod.sar.IBR.d1, Nagelkerke = TRUE)
## 
## Call:spdep::errorsarlm(formula = A ~ PatchSize + IBR, data = Dianthus.df, 
##     listw = listw.d1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.654613 -0.085961  0.016066  0.091687  0.389419 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.067066   0.077809 52.2696   <2e-16
## PatchSize   0.041416   0.018554  2.2322   0.0256
## IBR         0.090578   0.094946  0.9540   0.3401
## 
## Lambda: 0.22322, LR test value: 2.6442, p-value: 0.10393
## Asymptotic standard error: 0.13659
##     z-value: 1.6343, p-value: 0.1022
## Wald statistic: 2.6708, p-value: 0.1022
## 
## Log likelihood: 12.49972 for error model
## ML residual variance (sigma squared): 0.037595, (sigma: 0.19389)
## Nagelkerke pseudo-R-squared: 0.26613 
## Number of observations: 59 
## Number of parameters estimated: 5 
## AIC: -14.999, (AIC for lm: -14.355)

6. Spatial filtering with MEM using package ‘spmoran’

See tutorial for ‘spmoran’ package: https://arxiv.org/ftp/arxiv/papers/1703/1703.04467.pdf

Both GLS and SAR fitted a spatially correlated error structure of a relatively simple form to the data. Gene flow could be more complex and for example, could create spatial autocorrelation structure that is not the same in all directions or in all parts of the study area. Moran Eigenvector Maps (MEM) allows a more flexible modeling of spatial structure in the data. In spatial filtering, we use MEM spatial eigenvectors to account for any spatial structure while fitting and testing the effect of our predictors.

a) Default method

The new package ‘spmoran’ makes this really easy. First, we create the MEM spatial eigenvectors. This implies defining neighbors and weights, but this is well hidden in the code below. The function ‘meigen’ here takes the coordinates, calculates a minimum spanning tree (so that each site has at least one neighbour), and finds the maximum distance ‘h’ from the spanning tree. It then calculates neighbor weights as exp(-dij / h).

Note: if you have many sites (> 200), the function ‘meigen_f’ may be used instead of ‘meigen’, it should even work for >1000 sites.

The function ‘esf’ then performs the spatial filtering. Here it uses stepwise selection of MEM spatial eigenvectors using an R-squared criterion (fn = “r2”).

# lm model: using truncated distance matrix (max of min spanning tree distance)
meig <- spmoran::meigen(coords=xy)
##  9/59 eigen-pairs are extracted
sfd.res <- spmoran::esf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")],
                       meig=meig, fn = "r2" )
##   5/9 eigenvectors are selected

The objects created by functions ‘meigen’ and ‘esf’ contain a lot of information:

Let’s look at the table ‘b’ with regression results for the predictors first:

sfd.res$b
##               Estimate         SE    t_value      p_value
## (Intercept) 4.08678879 0.06759658 60.4585130 3.838659e-49
## PatchSize   0.03642554 0.01673612  2.1764625 3.417603e-02
## IBR         0.04687286 0.08602602  0.5448685 5.882188e-01

Again, PatchSize is statistically significant but not IBR.

Next, we look at the table ‘r’ with regression results for MEM spatial eigenvectors:

sfd.res$r
##       Estimate        SE   t_value     p_value
## sf6  0.5938646 0.1782765  3.331144 0.001613967
## sf1  0.3962440 0.1788408  2.215624 0.031207615
## sf7 -0.4065267 0.1895836 -2.144313 0.036795842
## sf3 -0.3474148 0.1760289 -1.973624 0.053854181
## sf5 -0.3290397 0.1757533 -1.872168 0.066922529

Five MEM spatial eigenvectors were important enough to be included in the model. Here they are ranked by their (absolute value of) slope coefficient, and thus by the strength of their association with the response variable. Eigenvector ‘sf6’ was by far the most important.

Note: some eigenvectors are included despite having a p-value > 0.05. This may have two reasons. First, the eigenvectors were selected without taking into account predictors X. Second, a different test was used in the stepwise eigenvector selection. The type of test can be specified with an argument ‘fn’ (see ‘?esf’ helpfile and ‘spmoran’ tutorial).

Finally, let’s look at the summary results for the fitted model:

sfd.res$e
##                 stat
## resid_SE   0.1728755
## adjR2      0.4374576
## logLik    24.1369653
## AIC      -30.2739307
## BIC      -11.5760937

Here, ‘adjR2’ is rather high (0.437), but this includes the selected MEM spatial eigenvectors!

b) Using a custom connectivity matrix

We know already that ‘listw.d1’ fit the data well, so let’s re-run the model with our own definition of spatial weights. With the funciton ‘listw2mat’, we convert from ‘listw’ format to a full connnectivity matrix.

cmat.d1    <- spdep::listw2mat( listw.d1) 
meigw  <- spmoran::meigen( cmat = cmat.d1 )
##  Note:
##    cmat is symmetrized by ( cmat + t( cmat ) ) / 2
##  27/59 eigen-pairs are extracted
sfw.res <- spmoran::esf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")],
                       meig=meigw, fn = "r2" )
##   15/27 eigenvectors are selected
sfw.res$b
##               Estimate         SE    t_value      p_value
## (Intercept) 4.07534252 0.07375191 55.2574522 3.989818e-40
## PatchSize   0.03114008 0.01772391  1.7569529 8.639209e-02
## IBR         0.05720775 0.09657088  0.5923913 5.568418e-01
tibble::as.tibble(sfw.res$r)
## # A tibble: 15 x 4
##    Estimate    SE t_value  p_value
##  *    <dbl> <dbl>   <dbl>    <dbl>
##  1    0.666 0.171   3.90  0.000348
##  2   -0.404 0.172  -2.34  0.0240  
##  3   -0.509 0.160  -3.19  0.00273 
##  4   -0.424 0.171  -2.48  0.0175  
##  5    0.358 0.160   2.24  0.0307  
##  6   -0.314 0.167  -1.88  0.0668  
##  7    0.284 0.176   1.62  0.113   
##  8   -0.185 0.156  -1.18  0.243   
##  9   -0.160 0.164  -0.973 0.336   
## 10   -0.244 0.158  -1.54  0.131   
## 11    0.260 0.163   1.60  0.117   
## 12    0.358 0.184   1.94  0.0588  
## 13    0.332 0.177   1.88  0.0679  
## 14   -0.237 0.162  -1.47  0.151   
## 15    0.215 0.159   1.35  0.185
sfw.res$e
##                 stat
## resid_SE   0.1520845
## adjR2      0.5646300
## logLik    38.1354212
## AIC      -38.2708424
## BIC        1.2023691

Note: the messages tell us that ‘cmat’ has been made symmetric before analysis, that 27 out of 59 MEM spatial eigenvector (and their eigenvalues, hence ‘pairs’) were retained initially and subjected to stepwise selection, which then returned 15 statistically significant MEM eigenvectors that were included in the regression model with the predictor variables X (PatchSize and IBR).

Questions:

c) Plot spatial eigenvectors

So far, we have treated the MEM spatial eigenvectors as a black box. What kind of patterns do they represent?

First, we plot all the selected (significant) eigenvectors. A convenient way to do so is converting to a SpatialPointsDataFrame and then use the function ‘spplot’.

Here we need tweak the column names of the eigenvectors, which are called “X1”, “X2” etc., to match the row names of table ‘sf.res$r’, which uses “sf1”, “sf2” etc.

SF <- data.frame(xy, sf=meigw$sf)
names(SF) <- gsub("[.]", "", names(SF))
sp::coordinates(SF) <- ~ x + y
sp::spplot(SF, row.names(sfw.res$r), colorkey = TRUE)

The panel starts at the bottom left, so that the most important spatial eigenvector (sf6) is plotted at the bottom left, the second most important (sf22) second from left, etc.

The smallest numbers are patterns with the largest spatial scale (sf1), which here shows a gradient from East to West. The most important eigenvector (sf6) shows a finer-scale pattern with the highest values (yellow) in the center, lowest values East and West, and intermediate value North and South.

However, these patterns individually are not meaningful. More importantly, we can plot the total spatial component in the response as a weighted sum of these component patterns, where the weights correspond to the regression coefficients in table ‘sf.res$r’. Here we create a panel with two plots, the modeled spatial components ‘sf’ on the left and the response ‘A’ on the right (the mean has been removed to make values comparable).

SF@data$sf <- sfw.res$sf
SF@data$A <- scale(Dianthus.df$A, scale = FALSE)
sp::spplot(SF, c("sf", "A"), colorkey = TRUE)

Obviously, a big part of the variation in allelic richness is already captured by ‘sf’. In essence the model then tries to explain the difference between these two sets of values by the predictors “PatchSize” and “IBR”.

Let’s quantify the correlation of this spatial component with allelic richness, and compare the correlation between the two models:

cor(Dianthus.df$A, data.frame(sfd=sfd.res$sf, sfw=sfw.res$sf))
##            sfd       sfw
## [1,] 0.6252594 0.7720016

With the default method (defining neighbors based on a distance cut-off), the spatial component modeled by the significant MEM spatial eigenvectors showed a correlation of 0.625 with the response variable. Using a Gabriel graph with inverse distance weights increased this correlation to 0.772.

This means that the spatial eigenvectors derived from the Gabriel graph were more effective at capturing the spatial variation in allelic richness than the default method. This spatial component is then controlled for when assessing the relationship between allelic richness and the predictors (PatchSize and IBR).

d) Random effect model

The previous model selected 15 MEM spatial eigenvectors, and thus fitted 15 additional models. Just like the random effects for family and population in Week 6 lab, we can save a few parameters here by fitting the set of MEM eigenvectors as a random effect. This is done by the function ‘resf’.

sfr.res <- spmoran::resf( y=Dianthus.df$A, x=Dianthus.df[,c("PatchSize", "IBR")], 
               meig = meigw, method = "reml" ) 
##  RE-ESF model fit by REML
sfr.res$b
##               Estimate         SE   t_value    p_value
## (Intercept) 4.05889381 0.06856506 59.197696 0.00000000
## PatchSize   0.04215982 0.01675185  2.516726 0.01531481
## IBR         0.10077818 0.08727232  1.154755 0.25402637
tibble::as.tibble(sfr.res$r)
## # A tibble: 27 x 1
##    b...1.nx..
##  *      <dbl>
##  1     0.164 
##  2    -0.0408
##  3     0.0324
##  4    -0.0999
##  5     0.118 
##  6     0.243 
##  7     0.0281
##  8     0.0691
##  9    -0.174 
## 10    -0.0201
## # ... with 17 more rows
sfr.res$e
##                   stat
## resid_SE     0.1636501
## adjR2(cond)  0.4464726
## rlogLik      4.7082457
## AIC          4.5835085
## BIC         19.1262706
sfr.res$s
##                       par
## shrink_sf_SE    0.1739337
## shrink_sf_alpha 0.5941833

As in Week 6 lab, the conditional R-squared is the variance explained by the fixed effects (PatchSize and IBR) and the random effects (significant MEM spatial eigenvectors) together. It is adjusted for the number of effects that were estimated.

Note: we can’t compare AIC with the previous models, as the model was fitted with ‘reml’.

We get an additional output ‘sfr.res$s’ with two parameters:

7. Fit spatially varying coefficients model with package ‘spmoran’

See: https://arxiv.org/ftp/arxiv/papers/1703/1703.04467.pdf

Now comes the coolest part!

So far, we have fitted the same model for all sites. Geographically weighted regression (GWR) would allow relaxing this. Spatial filtering with MEM can be used to accomplish the same goal, and the ‘spmoran’ tutorial calls this a ‘Spatially Varying Coefficients’ model (SVC). The main advantage is that we can visualize how the slope parameter estimates, and their p-values, vary across the study area! This is a great exploratory tool that can help us better understand what is going on.

Model with PatchSize and IBR

We fit the model with ‘resf_vc’.

rv_res <- spmoran::resf_vc( y=Dianthus.df$A, 
                            x = Dianthus.df[,c("PatchSize", "IBR")], 
                            xconst = NULL, meig = meigw, method = "reml" )
##  RE-ESF model fit by REML

Instead of one slope estimate for each predictor, we now get a different estimate for each combination of parameter and site (sounds like overfitting?). Here’s a summary of the distbiution of these estimates.

summary( rv_res$b_vc ) 
##   (Intercept)      PatchSize              IBR           
##  Min.   :4.125   Min.   :-0.005689   Min.   :-0.805122  
##  1st Qu.:4.125   1st Qu.: 0.029205   1st Qu.:-0.087035  
##  Median :4.125   Median : 0.044763   Median : 0.042922  
##  Mean   :4.125   Mean   : 0.046007   Mean   :-0.009952  
##  3rd Qu.:4.125   3rd Qu.: 0.060693   3rd Qu.: 0.127195  
##  Max.   :4.125   Max.   : 0.098180   Max.   : 0.347704

The slope estimate for PatchSize varied between -0.0056 and 0.098, with a mean of 0.046. The slope estimate for the ‘IBR’ term varied between -0.805 and 0.348, with a mean of -0.01! That is an astounding range of variation. Keep in mind that we really expect a positive relationship, there is no biological explanation for a ngeative relationship.

Here is a similar summary of the p-values:

summary( rv_res$p_vc )
##   (Intercept)   PatchSize              IBR           
##  Min.   :0    Min.   :0.0000025   Min.   :0.0000013  
##  1st Qu.:0    1st Qu.:0.1033758   1st Qu.:0.0781595  
##  Median :0    Median :0.2269456   Median :0.1648269  
##  Mean   :0    Mean   :0.3014071   Mean   :0.3055278  
##  3rd Qu.:0    3rd Qu.:0.4695559   3rd Qu.:0.6244781  
##  Max.   :0    Max.   :0.8774384   Max.   :0.8993203

For both variables, most sites do not show a significant effect (i.e., only few sites show a p-value < 0.05).

We could print these results by site (type ‘rv_res\(b_vc' or 'rv_res\)p_vc’). Even better, we can plot them in space. We start with combining the data (‘Dianthus.df’) and the results into one data frame ‘Results’. By specifying ‘b=rv_res\(b_vc' and "p=rv_res\)p_vc’, R will create column names that start with ‘b’ or ‘p’, respectively.

Result <- data.frame(Dianthus.df, b=rv_res$b_vc, p=rv_res$p_vc)
names(Result)
##  [1] "A"             "IBD"           "IBR"           "PatchSize"    
##  [5] "Longitude"     "Latitude"      "x"             "y"            
##  [9] "b..Intercept." "b.PatchSize"   "b.IBR"         "p..Intercept."
## [13] "p.PatchSize"   "p.IBR"

Let’s start with PatchSize. Here, we first plot PatchSize in space, with symbol size as a function of patch size. In a second plot, we color sites by statistical significance and the size of the symbols represents the parameter estimate of the regression slope coefficient for Patch Size. The layer ‘coord_fixed’ keeps controls the aspect ratio between x- and y-axes.

require(ggplot2)
ggplot(as.data.frame(Result), aes(x, y, size=PatchSize)) +
  geom_point(color="darkblue") + coord_fixed()
ggplot(as.data.frame(Result), aes(x, y, col=p.PatchSize < 0.05, size=b.PatchSize)) +
  geom_point() + coord_fixed()

Let’s do the same for ‘IBR’:

require(ggplot2)
ggplot(as.data.frame(Result), aes(x, y, size=IBR)) +
  geom_point(color="darkgreen") + coord_fixed()
ggplot(as.data.frame(Result), aes(x, y, col=p.IBR < 0.05, size=b.IBR)) +
  geom_point() + coord_fixed()

Model with IBR only

Keep in mind that ‘IBR’ and ‘PatchSize’ showed a strong correlation. The parameter estimates could therefore depend quite a bit on the other variables. To help with the interpretation, let’s repeat the last analysis just with ‘IBR’, without ‘PatchSize’.

rv_res <- spmoran::resf_vc( y=Dianthus.df$A, 
                            x = Dianthus.df[,c("IBR")], 
                            xconst = NULL, meig = meigw, method = "reml" )
##  RE-ESF model fit by REML
summary( rv_res$b_vc ) 
##   (Intercept)          V1         
##  Min.   :3.801   Min.   :-0.2020  
##  1st Qu.:3.919   1st Qu.: 0.1545  
##  Median :3.941   Median : 0.2549  
##  Mean   :3.941   Mean   : 0.2101  
##  3rd Qu.:3.969   3rd Qu.: 0.3113  
##  Max.   :4.060   Max.   : 0.4180

Now the range of slope estimates is smaller, most sites have a positive estimate.

summary( rv_res$p_vc ) 
##   (Intercept)       V1           
##  Min.   :0    Min.   :0.0000002  
##  1st Qu.:0    1st Qu.:0.0023091  
##  Median :0    Median :0.0341999  
##  Mean   :0    Mean   :0.1680300  
##  3rd Qu.:0    3rd Qu.:0.1952481  
##  Max.   :0    Max.   :0.9620520

Also, a larger proportion of sites nows has p-values < 0.05.

Let’s plot the results onto a gray-scale (‘mapcolor = “bw”), Google terrain map (’source = “google”, maptype = “terrain”) to facilitate interpretation. Note: here the zoom level ’zoom = 12’ covers the entire study area, whereas the default value would actually cut off a large number of sites.

Result <- data.frame(Dianthus.df, b=rv_res$b_vc, p=rv_res$p_vc, resid=rv_res$resid)
names(Result)
##  [1] "A"             "IBD"           "IBR"           "PatchSize"    
##  [5] "Longitude"     "Latitude"      "x"             "y"            
##  [9] "b..Intercept." "b.V1"          "p..Intercept." "p.V1"         
## [13] "resid"
ggmap::qmplot(x=Longitude, y=Latitude, data=Result,
              source = "google", maptype = "terrain", zoom = 12,
              col=p.V1 < 0.05, size=b.V1, mapcolor = "bw")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=48.986778,11.000013&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead

This is a very different map of results!

8. Conclusions